Goto

Collaborating Authors

 residual ratio


Improving Robustness of Foundation Models in Domain Adaptation with Soup-Adapters

Roschkowski, Marco

arXiv.org Artificial Intelligence

Computer vision has seen tremendous progress due to the emergence of deep learning technologies. Large supervised benchmark datasets such as ImageNet (Deng et al. 2009) have led to several methodological breakthroughs. These include overcoming traditional computer vision methods in (Krizhevsky et al. 2012), the introduction of skip connections in (He et al. 2016), advanced architectures such as inverted bottlenecks in (San-dler et al. 2018) and improved scaling techniques in (Koonce and Koonce 2021). A long-standing limitation has been the dependence on such large curated datasets which are expensive to obtain. Recently, the paradigm of foundation models has become an attractive alternative in which a single model is being trained on a corpus of data large enough to generalize well on several distinct downstream tasks. One notable vision foundation model is CLIP (Rad-ford et al. 2021) which learns a joint embedding space of images and their corresponding captions. This architecture naturally has the ability to perform zero-shot classification by describing visual categories via text prompts. Another popular foundation model is DINOv2 (Oquab et al. 2023) which has been trained on a large curated corpus of images to produce robust features. These models can be easily adapted for few-shot learning using KNN evaluation or prototypical learning (Snell et al. 2017).


Radial Networks: Dynamic Layer Routing for High-Performance Large Language Models

Dotzel, Jordan, Akhauri, Yash, AbouElhamayed, Ahmed S., Jiang, Carly, Abdelfattah, Mohamed, Zhang, Zhiru

arXiv.org Artificial Intelligence

Large language models (LLMs) often struggle with strict memory, latency, and power demands. To meet these demands, various forms of dynamic sparsity have been proposed that reduce compute on an input-by-input basis. These methods improve over static methods by exploiting the variance across individual inputs, which has steadily grown with the exponential increase in training data. Yet, the increasing depth within modern models, currently with hundreds of layers, has opened opportunities for dynamic layer sparsity, which skips the computation for entire layers. In this work, we explore the practicality of layer sparsity by profiling residual connections and establish the relationship between model depth and layer sparsity. For example, the residual blocks in the OPT-66B model have a median contribution of 5% to its output. We then take advantage of this dynamic sparsity and propose Radial Networks, which perform token-level routing between layers guided by a trained router module. These networks can be used in a post-training distillation from sequential networks or trained from scratch to co-learn the router and layer weights. They enable scaling to larger model sizes by decoupling the number of layers from the dynamic depth of the network, and their design allows for layer reuse. By varying the compute token by token, they reduce the overall resources needed for generating entire sequences. Overall, this leads to larger capacity networks with significantly lower compute and serving costs for large language models.